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ABSTRACT 

As probabilistic data management is becoming one of the main re- 
search focuses and keyword search is turning into a more popular 
query means, it is natural to think how to support keyword queries 
on probabilistic XML data. With regards to keyword query on de- 
terministic XML documents, ELCA (Exclusive Lowest Common 
Ancestor) semantics allows more relevant fragments rooted at the 
ELCAs to appear as results and is more popular compared with 
other keyword query result semantics (such as SLCAs). 

In this paper, we investigate how to evaluate ELCA results for 
keyword queries on probabilistic XML documents. After defin- 
ing probabilistic ELCA semantics in terms of possible world se- 
mantics, we propose an approach to compute ELCA probabilities 
without generating possible worlds. Then we develop an efficient 
stack-based algorithm that can find all probabilistic ELCA results 
and their ELCA probabilities for a given keyword query on a prob- 
abilistic XML document. Finally, we experimentally evaluate the 
proposed ELCA algorithm and compare it with its SLCA counter- 
part in aspects of result effectiveness, time and space efficiency, and 
scalability. 

1. INTRODUCTION 

Uncertain data management is currently one of the main research 
focuses in database community. Uncertain data may be generated 
by different reasons, such as limited observation equipment, un- 
supervised data integration, conflicting feedbacks. Moreover, un- 
certainty itself is inherent in nature. This drives the technicians to 
face the reality and develop specific database solutions to embrace 
the uncertain world. In many web applications, such as informa- 
tion extraction, a lot of uncertain data are automatically generated 
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by crawlers or mining systems, and most of the time they are from 
tree-like raw data. In consequence, it is natural to organize the 
extracted information in a semi-structured way with probabilities 
attached showing the confidence for the collected information. In 
addition, dependencies between extracted information can be easily 
captured by parent-child relationship in a tree-like XML document. 
As a result, research on probabilistic XML data management is ex- 
tensively under way. 

Many probabilistic models jT]|2l|3]|4l|5]|6l|7J have been proposed 
to describe probabilistic XML data. The expressiveness between 
different models is discussed in fT^. Beyond the above, querying 
probabilistic XML data to retrieve useful information is of equal 
importance. Current studies mainly focused on twig queries 18 9] 
l0|, with little light |I1| shed on keyword queries on probabilis- 
tic XML data. However, support for keyword search is important 
and promising, because users will be relieved from learning com- 
plex query languages (such as XPath, XQuery) and are not required 
to know the schema of the probabilistic XML document. A user 
only needs to submit a few keywords and the system will auto- 
matically find some suitable fragments from the probabilistic XML 
document. 

There has been established works on keyword search over deter- 
ministic XML data. One of the most popular semantics to model 
keyword query results on an deterministic XML document is the 
ELCA (Exclusive Lowest Common Ancestor) semantics 1121 II3I 
14]. We introduce the ELCA semantics using an example. Formal 
definitions will be introduced in Section |2] Fig.[TJa) shows an or- 
dinary XML tree. Nodes {ai,a2,a3} directly contain keyword 

a, and nodes {&i, &2, &3, 64} directly contain keyword b. Node 
{xi, X2, 2:4} are considered as ELCAs of keywords a and b. An 
ELCA is firstly an LCA, and after excluding all its children which 
contain all keywords, the LCA still contains all the keywords. Node 
X2 is an ELCA, because after excluding xi which contains all the 
keyword, X2 still has its own contributors ai and &2. Node X3 is 
not an ELCA, because after excluding X4, X3 only covers keyword 

b. Nodes xi and X4 are also ELCAs, because they contain both 
keywords. No children of xi or X4 contain all the keywords, so 
no need to exclude any child from xi or 2:4. Another popular se- 
mantics is SLCA (Smallest LCA) semantics 1151 1161 . It asks for 
the LCAs that are not ancestors of other LCAs. For example, node 
xi and X4, are SLCAs on the tree, but X2 is not, because it is an 
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Figure 1: Examples of ELCAs and A Probabilistic XML Tree 



ancestor of a;i. It is not difficult to see that the ELCA resuU is a 
superset of the SLCA result, so the ELCA semantics can provide 
more interesting information to users. This motivates us to study 
the ELCA semantics, and particularly on a new type of data, prob- 
abilistic XML data. Note that although SLCA semantics is studied 
on probabilistic XML data in 1 11 1, the solution cannot be used to 
solve ELCA semantics, as readers may notice that the ELCA se- 
mantics is indeed more complex than the SLCA semantics. 

On a probabilistic XML document, nodes may appear or not, 
accordingly a node is (usually) not certain to be an ELCA. As a 
result, we want to find not only those possible ELCA nodes, but 
also their ELCA probabilities. Before we point out the computa- 
tion challenge, we briefly introduce the probabilistic XML model 
used throughout this paper. We consider a popular probabilistic 
XML model, PrXML^"''*'™'"=> IZlllTJ, where a probabilistic XML 
document (also called p-document) is regarded as a tree with two 
types of nodes: ordinary nodes and distributional nodes. Ordi- 
nary nodes store the actual data and distributional nodes define the 
probability distribution for the child nodes. There are two types of 
distributional nodes: IND and MUX. IND means the child nodes 
may appear independently and MUX means the child nodes are 
mutually-exclusive (i.e. only one child can appear among the de- 
fined alternative children). A real number from (0,1] is attached on 
each edge in the XML tree, indicating the conditional probability 
that the child node will appear under the parent node given the ex- 
istence of the parent node. A randomly generated document from 
a p-document is called a possible world. Apparently, each possible 
world has a probability. The sum of the probabilities of all possible 
worlds is 1. A probabilistic XML tree is given in Fig.[TJb), where 
unweighted edges have the default probability 1 . 

Given a keyword query, and a p-document, a node may be an 
ELCA of the keywords in some possible worlds but not in other 
possible worlds. We cannot ignore the distributional nodes, be- 
cause the ELCA results on a probabilistic XML tree may be totally 
different from those on a deterministic XML tree. For example, in 
Fig.[TJb), Xi is no longer an ELCA due to the MUX semantics, 
may become an ELCA if a possible world contains az not bz, but on 
the deterministic version, xz is never an ELCA node. Furthermore, 
x\, a 100% ELCA node in Fig.lJIa), becomes a conditional ELCA 
with probability 0.8*0.6*0.7 in Fig.lTJb). X2 also becomes an 80% 
ELCA node. As a result, deterministic ELCA solutions lT2lfT3iri4j 
are not applicable to the new problem. Furthermore, to find out the 
possible ELCA nodes is not enough. Users may want to know the 
ELCA probabilities of the possible ELCAs. 

To solve the problem, a straightforward and safe method is to 
generate all possible worlds from the given p-document, evaluate 
ELCAs using existing ELCA algorithms on deterministic XML for 



each possible world, and combine the result finally. However, it is 
obvious that this method is infeasible, because the computation cost 
is too high, since the number of possible worlds is exponential. The 
challenge is how to evaluate the ELCA probability of a node using 
only the p-document without generating possible worlds. The idea 
of our approach is to evaluate the ELCA probabilities in a bottom- 
up manner. 

We summarize the contributions of this paper as follows: 

• To the best of our knowledge, this is the first work that studies 
ELCA semantics on probabilistic XML data. 

• We have defined probabilistic ELCA semantics for keyword 
search on probabilistic XML documents. We have proposed 
an approach on how to evaluate ELCA probabilities without 
generating possible world and have designed a stack-based 
algorithm, PrELCA algorithm, to find the probabilistic EL- 
CAs and their probabilities. 

• We have conducted extensive experiments to test the result 
effectiveness, time and space efficiency, scalability of the 
PrELCA algorithm. 

The rest of this paper is organized as follows. In Section |2] we 
introduce ELCA semantics on a deterministic XML document and 
define probabilistic ELCA semantics on a probabilistic XML doc- 
ument. In Section[3] we propose how to compute ELCA probabili- 
ties on a probabilistic XML document without generating possible 
worlds. An algorithm, PrELCA, is introduced in Section |4] to ex- 
plain how to put the conceptual idea in Section [3] into procedural 
computation steps. We report the experiment results in Section [5] 
Related works and Conclusion are in Section |6] and Section |7] re- 
spectively. 

2. PRELIMINARIES 

In this section, we first introduce ELCA semantics on a deter- 
ministic XML document, and then define probabilistic ELCA se- 
mantics on a probabilistic XML document. 

2.1 ELCA Semantics on Deterministic XML 

A deterministic XML document is usually modeled as a labeled 
ordered tree. Each node of the XML tree corresponds to an XML 
element, an attribute or a text string. The leaf nodes are all text 
strings. A keyword may appear in element names, attribute names 
or text strings. If a keyword k appears in the subtree rooted at a 
node V, we say the node v contains keyword k. If k appears in the 
element name or attribute name of w, or k appears in the text value 
of V when w is a text string, we say node v directly contains key- 
word k. A keyword query on a deterministic XML document often 
asks for an XML node that contains all the keywords, therefore, 
for large XML documents, indexes are often built to record which 
nodes directly contain which keywords. For example, for a key- 
word ki, all nodes directly contain ki are stored in a list Si (called 
inverted list) and can be retrieved altogether at once. 

We adopt the formalized ELCA semantics as the work 1 13|. We 
introduce some notions first. Let v <a u denote v is an ancestor 
node of u, and v <a u denote v <a u or v = u. The function 
lca{v-i , . . . ,v„) computes the Lowest Common Ancestor (LCA) of 
nodes vi, . . . ,v„. The LCA of sets Si, . . . , S„ is the set of LCAs 
for each combination of nodes in Si through S„. 

lca{ki, . . . , k„) = lca{Si, . . . , S„) = 

{lca{vi,. . . ,Vn)\vi e Si,. .. ,v„ e S„} 



Given n keywords {ki, . . . , kn} and their corresponding inverted 
lists Si, Sn of an XML tree T, the Exclusive LCA of these 
keywords on T is defined as: 

elca{ki, . . . , kn) = elca{Si, . . . , Sn) ~ 

{v\3vi e Si, . . . ,Vn e Sn{v = lca{vi, . . . , w„) A 
Vi G [1, n] flx{x e lca{Si,. . . , S„) A child{v, Vi) <a x))} 

where child{v, Vi) denotes the child node of v on the path from v 
to Vi. The meaning of a node v to be an ELCA is: v should contain 
all the keywords in the subtree rooted at v, and after excluding «'s 
children which also contain all the keywords from the subtree, the 
subtree still contains all the keywords. In other words, for each 
keyword, node v should have its own keyword contributors. 

2.2 ELCA Semantics on Probabilistic XML 

A probabilistic XML document (p-document) defines a proba- 
bility distribution over a space of deterministic XML documents. 
Each deterministic document belonging to this space is called a 
possible world. A p-document can be modelled as a labelled tree T 
with ordinary and distributional nodes. Ordinary nodes are regu- 
lar XML nodes that may appear in deterministic documents, while 
distributional nodes are used for describing a probabilistic process 
following which possible worlds can be generated. Distributional 
nodes do not occur in deterministic documents. 

We define ELCA semantics on a p-document with the help of 
possible worlds of the p-document. Given a p-document T and 
a keyword query {fci, A;2, . . . , we define probabilistic ELCA 
of these keywords on T as a set of node and probability pairs 
{v, Prfi^^{v)). Each node v is an ELCA node in at least one pos- 
sible world generated by T, and its probability Prfi^^{v) is the 
aggregated probability of all possible worlds that have node v as an 
ELCA. The formal definition of Prfi^^{v) is as follows: 

m 

Prficaiv) = '^{Pr{wi)\elca{v,Wi) = true} (1) 

i=l 

where {wi , . . . , Wm} denotes the set of possible worlds implied by 
T, elca{v, Wi) = true indicates that v is an ELCA in the possible 
world Wi. Pr{wi) is the existence probability of the possible world 

Wi. 

To develop the above discussion, Pr'^i^^{v) can also be com- 
puted with Equation[2] Here, Pr{path, — >„) indicates the existence 
probability of v in the possible worlds. It can be computed by mul- 
tiplying the conditional probabilities in T, along the path from the 
root r to node v. Pr^i^^{v) is the local probability for v being an 
ELCA in Tsub{v), where Tsub{v) denotes a subtree of T rooted at 

V. 

Prficaiv) = Pr{pathr^^) x Prl,(«) (2) 

To compute Pr^i^^{v), we have the following equation similar to 
Equation [T] 

m' 

PrLaiv) = J2^Pr{ti)\elca{v,t,) = true} (3) 

i = l 

where deterministic trees {ti,t2, ■..,t^i} are local possible worlds 
generated from Tsub{v), Pr{ti) is the probability of generating ti 
from Tsubiv); elca{v, ti) = true means v is an ELCA node in ti. 

In the following sections, we mainly focus on how to compute 
the local ELCA probability, Pr^i^^{v) for a node v. Pr{path, — >„) 
is easy to obtain if we have index recording the probabilities from 
the root to node v. Then it is not difficult to have the global proba- 
bility Pr'^i^^{v) using Equation|2] 



3. ELCA PROBABILITY COMPUTATION 

In this section, we introduce how to compute ELCA probabilities 
for nodes on a p-document without generating possible worlds. We 
start from introducing keyword distribution probabilities, and then 
introduce how to compute the ELCA probability for a node v using 
keyword distribution probabilities of v's children. 

3.1 Keyword Distribution Probabilities 

Given a keyword query Q = {fci, fc„} with n keywords, for 
each node v in the p-document T, we can assign an array tabv 
with size 2" to record the keyword distribution probabilities under 
V. For example, let {fci, ki} be a keyword query, entry tafe„[ll] 
records the probability when v contains both ki and k2 in all pos- 
sible worlds produced by Tsubi^)', similarly, ta6^[01] stores the 
probability when v contains only ^2; tabv[10] keeps the probabil- 
ity when V contains only fci; and ta6„[00] records the probability 
when neither of ki and k2 appears under v. Note that the proba- 
bilities stored in tabv of node v are local probabilities, i.e. these 
probabilities are based on the condition that node v exists in the 
possible worlds produced by T. To implement tab^, we only need 
to store non-zero entries of tabv using a HashMap to save space 
cost, but, for the clearness of discussion, let us describe tabv as an 
array with 2" entries. 

For a leaf node v in T, the entries of tabv are either 1 or 0. 
Precisely speaking, one entry is "1", and all the other entries are 
"0". When v is an internal node, let v's children be {ci, Cm}, 
let \i be the conditional probability when Ci appears under v, then 
tabv can be computed using {tabc^ , tabc^} and {Ai, Am}- 
We will elaborate the computation for different types of v: ordinary 
nodes, MUX nodes and IND nodes. 

3.1.1 Node V is an Ordinary node 

When V is an ordinary node, all the children of v will definitely 
appear under v, so we have Ai — ... = Am = 1. Let tabvln] be 
an entry in tabv, where fi is a binary expression of the entry index 
(eg. ta6i,[101] refers to tabv[5], here /x="101"), then tabv[fj,] can 
be computed using the following equation: 

m 

tabvl/j] ^ ^ J^tabcJ^i] (4) 

fj. — fJ-lV . . .V i — 1 

Here, tabc^lpi] is an entry in tabc^, fii gives the keyword occur- 
rences under v's child Ci, and ^1 V ... V fim gives the keyword 
occurrences among all v's children. Different {fii, fim} com- 
binations may produce the same /i, so the total probability of these 
combinations gives ta6„[/i]. 

Fig.|2](a) shows an example, where v is an ordinary node, ci, C2 
are v's children, v's keyword distribution table can be computed 
using ci, C2's keyword distribution tables. Take entry tafe„[01], de- 
noted as p2, as an example: p2 stands for the case that v contains 
keyword k2 but does not contain ki. It correspondingly implies 
three cases: (1) ci contains fc2 and C2 contains neither fci, ^2; (2) 
C2 contains ^2 and ci contains neither; (3) both ci, C2 only contains 
keyword k2 . The probability sum of the three cases gives the local 
probability p2. 

The naive way to compute tabv based on Equation|4]results in an 
0(m2"'") algorithm, because each tab^ contains 2" entries, and 
there are (2")™ = 2"™ combinations of {^1, fim}. For each 
combination, computing Ili^i ^o^^ci [Mi] takes 0(m) time. How- 
ever, we can compute tabv progressively in 0(m2^") time. The 
idea is to use an intermediate array tab'v to record a temporary dis- 
tribution and then combine the intermediate array tab'v with each 
tabc- one by one {not all together). We now illustrate the process: 
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Figure 2: Evaluation of Keyword Distribution Table 



at the beginning, tab'^ is initialized using Equation[5] and then each 
tabci (i G [1, rn]) is merged with tab'^ based on Equation|6] in the 
end, after incorporating all v's children, tab^ is set as tab'y. 



tab[, [n] 



/ 00., 

1 (ai = 00., 



.0) 
.0) 



(5) 



tab'^ [fj,'] ■ tab^^ [fi^] (6) 

Generally speaking, we are expecting a keyword query consist- 
ing of less than n < 5 keywords, so 2^" is not very large, besides 
the number of none-zero entries is much smaller than the theoret- 
ical bound 2", and therefore we can consider the complexity of 
computing tab^ as 0{m). When we are facing too many keywords, 
a situation out of the scope of this paper, some preprocessing tech- 
niques may be adopted to cut down the number of keywords, such 
as correlating a few keyword as one phrase. In this paper, we keep 
our discussion on queries with only a few keywords. 

3.1.2 Node v is an MUX node 

For an MUX node. Equation |7] shows how to compute tabv[yL\ 
under mutually-exclusive semantics. A keyword distribution /i ap- 
pearing at node v implies that /i appears at one of ?;'s children, 
and thus X^I^i ' ^oi^'cJa'] gives tabv[n]. The case /i="00...0" 
is specially treated. An example is given in Fig. |2] (b), consider 
node V as an MUX node now, then ta6„[01] can be computed as 
Ai ■ tabci [01] + A2 ■ tabc2 [01] . Differently, the entry tab^ [00] in- 
cludes an extra (1 — Ai — A2) component, because the absence 
of both Cl and C2 also implies that node v does not contain any 



keywords. 
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Similar to the ordinary case, tafe„ can be progressively computed 
under mutually-exclusive semantics as well. At the beginning, ini- 
tialize table tab'^ using Equation |5] the same as the ordinary case; 
and then tab'^ is increased by merging with taba (i £ [1, wi]) pro- 
gressively using Equation[8] In the end, set tabv as tab'^. Both the 
straightforward and the progressive methods take 0{ra2^) com- 
plexity. 
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tab^ [/i] -f 

3.1.3 Node V is an IND node 

When V is an IND node, the computation of tafe„ is similar to the 
ordinary case. Before directly applying Equations[5]and|6l we need 
to standardize the keyword distribution table. The idea is to trans- 
form edge probability A; into 1, and make corresponding changes 
to the keyword distribution table with no side-effects. The modifi- 
cation is based on Equation|9] An example is shown in Fig.|2](c), ci 
and C2 are two children of IND node v with probabilities Ai , A2 , we 
can equally transform the keyword distribution tables into the right 
ones and change probabilities on the edges into 1. The tabc-^ [00] 
and fafec2[00] fields have (1 — Ai) and (1 — A2) components, be- 
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Figure 3: Comparison of keyword distribution probability and ELCA probability 



cause the absence of a child also implies that no keyword instances 
could appear under that child. After the transformation, we can 
compute v's keyword distribution table using the transformed key- 
word distribution tables of ci and C2 following the same way as 
Section irm 



A. 



Xi ■ tabc- [n] 
tabci [0] + 1 - 



A. 



(M/00...0) 
(m = 00.. .0) 



(9) 



In summary, we can obtain keyword distribution probabilities 
for every node in the p-document. The computation can be done 
in a bottom-up manner progressively. In the next section, we will 
show how to obtain the ELCA probability of node v using keyword 
distribution probabilities of ?;'s children. 

3.2 ELCA Probability 

We consider ELCA nodes to be ordinary nodes only. We first 
point out two cases in which we do not need to compute the ELCA 
probability or we can simply reuse the ELCA probability of a child 
node, after that we discuss when we need to compute ELCA prob- 
abilities and how to do it using keyword distribution table. 

Case 1: u is an ordinary node, and v has a distributional node 
as a single child. For this case, we do not need to compute ELCA 
probability for v, because the child distributional node will pass its 
probability upward to v. 

Case 2: v is an MUX node and has ELCA probability as 0. Ac- 
cording to the MUX semantics, v has a single child. If the child 
does not contain all the keywords, then v does not contain all the 
keywords either, on the other hand, if the child contains all the 
keywords, the child will screen the keywords from contributing up- 
wards. Node V still does not contain its own keyword contributors. 
In both situations, v is not regarded as an ELCA. 

In other cases, including v is an ordinary or IND node and v 
has a set of ordinary nodes as children, we need to compute ELCA 
probability for v. Note that, when v is an IND node, although v 
cannot be considered as an ELCA result, we still compute its ELCA 
probability, because this probability will be passed to v's parent 
according to Case 1. We discuss the ordinary node case first, IND 
node is similar. We first define a concept, contributing distribution, 
for the sake of better presenting the idea. 

Definition 1. Let ^be a binary expression of an entry index 
representing a keyword-distribution case, we define p, as the con- 
tributing distribution of ji with the value as follows: 



l^ ■ 



00... (At = ll...l) 



(10) 



It means that p, remains the same as p, in the most cases, except 
that when /j, is "11...1", fi is set to "00.. .0". According to ELCA 
semantics, if a child Ci of node v has contained all the keywords, 
Ci will screen the keyword instances from contributing upward to 
the its parent v. This is our motivation to define p.. That is to say: 
when is "11. ..1", we regard the contributing distribution pi of 
fii (to parent node v) as "00.. .0". 

For an ordinary node v, let {ci, Cm} be v's children and 
{ta&d , iafec,„ } be the keyword distribution probability arrays 
of {ci, Cm} respectively, let pi be the corresponding contribut- 
ing distribution of /i. Equation [TT] gives how to compute the local 
ELCA probability, Pr^i^^{v), for node v using {tabc^ , tabc^ }. 



PrLaiv)^ J2 Utab^^ll^,] (11) 

ll...l=/il V...V;i„, i = l 

To explain Equation [TT] v is an ELCA when the disjunction of 
pi, pm is "ll...r', which means after excluding all the chil- 
dren of V containing all the keywords, v still contains all the key- 
words under other children. All such {pi, p,n} combinations 
contribute to Pr^i^^{v), and hence the right part of Equation 1111 
gives an intuitive way to compute Pr^icai'")- 

Similar to keyword distribution probabilities, we can compute 
Pr^i^^iv) in a progressive way, reducing the computation com- 
plexity from 0(m2"'") to 0(m2^"). An intermediate array of 
size 2" is used, denoted as tab'^. Here, the function of tab" is sim- 
ilar to that of tab'^ used in the last section. To be specific, at the 
beginning, tab'^ is initialized by Equation[T2] As the computation 
goes on, tab" is continuously merged with tabc^ (i G [1, ?7i]) using 
Equation [13] In the end, after merging the intermediate table with 
all v's children one by one, entry ia&"[ll...l] gives Pr^i^^{v). 
Note that, although only one entry of tab", tab"[ll...l], is re- 
quired as the final result. In the computation, we need to store 
the whole table tab", because other entries are used to compute the 
final tab"[ll...l] entry 
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(13) 



For each child Ci, when we compute Pr^i^aiv), the array entry 
iafoc; [11-..1] acts the same as the entry tafecJOO...O], because it 
does not contribute any keyword to its parent. In consequence, we 



can first modify tabc- with Equation [141 and reuse Equation [6] to 
compute Pr^icaiv). 

( tabc^[00...0] +tabci[ll...l] [fi = 00. ..0) 
tabcM ^ < (/X = 11. ..1) 

I tabci [fi] otherwise 

(14) 

For an IND node v, we can standardize the keyword distribution 
table using Equation |9] Then, the computation is the same as the 
ordinary node case. 

In Fig. [3](b), we give an example to show how to compute the 
intermediate table tab". An ordinary node v has two children ci, 
C2. Their keyword distribution tables have been modified according 
to Equation [14] The probability of v containing both keywords 
(coming from different children) is given by ps = X2 ■ yz + ■ y2, 
which implies two cases: (1) ci contains fci and C2 contains k2\ 
(2) ci contains k2 and C2 contains k\. Neither ci, C2 are allowed 
to solely contain both keywords. In ELCA semantics, if a node 
contains all the keywords, the node will not make contributions 
to its parent. The probability is smaller than the probability ps = 
X4, + y4, + X2-y3+X3- 2/2 (given in Fig.[3ta)), which is the keyword 
distribution probability when node v contains both keywords, but 
not required to be from different children. Similarly, the calculation 
of tab'^ [01] and tab" [01] (i.e. p2) are also different. 

4. ALGORITHM 

In this section, we introduce an algorithm, PrELCA, to put the 
conceptual idea in the previous section into procedural computa- 
tion steps. We start with indexing probabilistic XML data, and 
then introduce PrELCA algorithm, in the end, we discuss why it 
is reluctant to find effective upper bounds for ELCA probabilities, 
and it turns out that PrELCA algorithm may be the only acceptable 
solution. 

4.1 Indexing Probabilistic XML Data 

We use Dewey Encoding Scheme 1181 to encode the probabilis- 
tic XML document. By playing a little trick, we can encode edge 
probability into Dewey code and save some space cost. We illus- 
trate the idea using Fig.|4] 1.3.6.9 is the Dewey code of the node 
X4, 0.9->l->0.7 are the probabilities on the path from the root to 
node X4. To assist with the encoding, we add a dummy probability 
1 before 0.9, and get the probability path as l->0.9->l->0.7. By 
performing an addition operation, Dewey code and probability can 
be combined and stored together as 2->3.9->7->9.7. We name the 
code as pDewey code. For each field y in the combined pDewey 
code, the corresponding Dewey code can be decoded as [j/] — 1, 
and the probability can be decoded as j/ + 1 — \y]. The correct- 
ness can be guaranteed, because edge probabilities always belong 
to (0, 1]. Apparently, this encoding trick trades time for space. 
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Algorithm 1 PrELCA Algorithm 
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Input: inverted lists of all keywords, 5* 

Output: a set of {r[], f) pairs R, where r[] is a node (repre- 
sented by its Dewey code), / is the ELCA probability of the 
node 

1: result set R :— (j); 

2: stack := empty; 

3: while not end of S do 

4: Read a new node v from S according to Dewey order, let 

array v[ ] record its Dewey code; 
5: p := lcp{stack, v); {find the longest common prefix p such 

that stack[i].node = v[i], 1 < i < p} 



6: 
7: 
8: 
9: 
10 
11 
12 

13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 



while stack. size > pdo 

let r[] be the Dewey code in the current stack; 
let / = stack. top{).elcaTbl[ll...l]; 
add (r[], /) into the result set R; 

popEntry — stack.popQ; 

merge popEntry.disTblW into stack. top{).disTbl\\\ 
calculate a new stack. topQ.elcaTblW using the previous 
stack. top{).elcaTbl[] and popEntry.disTblW; 
end while 

for p < j < V. length do 
disTbl[] = new disTable(); 
elcaTbll] = new elcaTable(); 

newEntry = {node := v[j]; disTblW; elcaTbl[]); 
stack. push{new Entry) ; 
end for 
end while 

while stack is not empty do 

Repeat line 7 to line 13; 
end while 



For each keyword, we store a list of nodes that directly con- 
tain that keyword using B-F-tree. The nodes are identified by their 
pDewey codes. For each node, we also store the node types (ORD, 
IND, MUX) on the path from the root to the current node. This 
node type vector helps to perform different types of calculation for 
different distribution types. For simplicity, we use the traditional 
Dewey code and omit pDewey code decoding when we introduce 
the PrELCA algorithm in the next section. 

4.2 PrELCA Algorithm 

According to the probabilistic ELCA semantics (Equation[T) de- 
fined in Section [2] a node with non-zero ELCA probability must 
contain all the keywords in some possible worlds. Therefore, all 
nodes in the keyword inverted lists and the ancestors of these nodes 
constitute a candidate ELCA set. The idea of the PrELCA algo- 
rithm is to mimic a postorder traversal of the original p-document 
using only the inverted lists. This can be realized by maintaining 
a stack. We choose to mimic postorder traversal, because it has 
the feature that a parent node is always visited after all its children 
have been visited. This feature exactly fits the idea on how to com- 
pute ELCA probability conceptually in Section[3] By scanning all 
the inverted lists once, PrELCA algorithm can find all nodes with 
non-zero ELCA probabilities without generating possible worlds. 
Algorithm [T] gives the procedural steps. We first go through the 
steps, and then give a running example to illustrate the algorithm. 

PrELCA algorithm takes keyword inverted lists as input, and out- 
puts all probabilistic ELCA nodes with their ELCA probabilities. 
The memory cost is a stack. Each entry of the stack contains the 
following information: (1) a visited node v, including the last num- 
ber of v's Dewey code (eg. 3 is recorded if 1.2.3 is the Dewey code 
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Figure 5: Stack status for some steps of running PrELCA algorithm on the probabilistic XML tree in Fig.2 (b) 



of v), the type of the node v; (2) an intermediate keyword distri- 
bution table of V, denoted as disTbll]; (3) an intermediate ELCA 
probability table of v, denoted as elcaTbl[]. At the beginning, the 
result set and the stack are initialized as empty (line 1 and 2). For 
each new node read from the inverted list (line 3-20), the algorithm 
will pop up some nodes whose descendant nodes will not be seen 
in future and output their ELCA probabilities (line 6-13), and push 
some new nodes into the stack (line 14-19). Line 5 is to calculate 
how many nodes need to be popped from the stack by finding the 
longest common prefix between the stack and the Dewey code of 
the new node. Line 7-9 is to output a result. After that, the top entry 
will be popped up (line 10), and its keyword distribution table will 
be merged into the new top entry (which records the parent node of 
the popped node) based on Equations|6]and[8]at line 11, and its new 
top entry's ELCA probability table will also be recalculated based 
on Equation[T3]at line 12. For each newly pushed node, its keyword 
distribution table disTbll] will be initialized using Equation [5] and 
Equation [T2] at line 15 and 16 respectively. Line 17 constructs a 
stack entry and line 18 pushes the new entry into the stack. After 
we finish reading the inverted lists, the remaining nodes in the stack 
are popped and checked finally (line 21-23). 

In Fig. |5] we show some snapshots for running PrELCA algo- 
rithm on the probabilistic XML tree in Fig.[TJb). At the beginning 
(step 1), the first keyword instance ai is read. All the ancestors of 
ai are pushed into the stack, with the corresponding disTbll] and 
elcaTbl[] fields initialized. In step 2, 02 is read according the or- 
der of Dewey code. The longest common prefix between the stack 
and the Dewey code of a2 is r.INDl.a;2. So ai is popped up, and 
3:2 's disTbl[] and elcaTblW are updated into (0, 1, 0, 0) and (0, 
1, 0, 0) by merging with ai's disTblW. Node ai is not a result, 
because the ai's elcaTbl[ll] is 0. Then, nodes IND2 and 02 are 
pushed into the stack. In step 3, 61 is read afterwards. Similar to 
step 2, 02 is popped up with IND2's disTblW updated, and then 
61 is pushed into the stack. In step 4, we read a new node 62 
from the inverted lists. In the stack, node 61 is first popped out 



of the stack. IND2's disTbl[] is updated into (0.12, 0.18, 0.28, 
0.42) by merging 61 's disTbl[] (0.3, 0, 0.7, 0) with IND2's current 
disTbl[] (0.4, 0.6, 0, 0). Readers may feel free to verify the com- 
putation. Similarly, a;i's disTbll] is updated as (0.12, 0.18, 0.28, 
0.42) when IND2 is popped out. xi's elcaTbll] is set as IND2's 
elcaTbl[], because IND2 is a single child distributional node of xi 
and thus it does not screen keywords from contributing upwards. 
(Recall Case 1 in Section |T2t . When xi is popped out, we find 
a;i's elcxiTbl[ll] is non-zero. Therefore, xi has local ELCA prob- 
ability, Pr^i^aixi) = 0.42. The global ELCA probabihty for xi 
can be obtained by multiplying 0.42 with the edge probabilities 
along the path from the root r to xi. In this example, the global 
ELCA probability Prfi^^{xi) — 0.42 * 0.8. An interesting scene 
takes place when xi is popped out of the stack, X2's disTbll] is 
updated accordingly as (0, 0.3, 0, 0.7) and 2;2's elcaTblW is up- 
dated as (0, 0.72, 0, 0.28). For the first time during the process, 
X2's elcaTbl[] is updated into a different value from its disTbl\]. 
The reason is that xi has screened keyword a, b from contributing 
upwards when xi itself has already contained both keywords. So 
the local probability that X2 contains both keywords, represented 
by X2's disTbl[ll] is 0.7, but the local ELCA probability of X2, 
represented by a;2's elcaTbl[ll] is only 0.28. At last, 62 is pushed 
into the stack. 

4.3 No Early Stop 

In this subsection, we explain why we need to access all keyword 
inverted list once, and it is not likely to develop an algorithm that 
can stop earlier. We use an example to illustrate the idea shown 
in Fig. [6] Reader can find that node v indeed has the ELCA prob- 
ability 1, i.e. node v is 100% an ELCA node, but we are totally 
unclear about this result when we are examining the previous sub- 
trees Ti, T2, etc. One may want to access the nodes in the order 
of probability values, but it does not change the nature that ELCA 
probability is always increasing according to Equation[T3] Further- 
more, that sort of algorithms may need to access the inverted list 



multiple times, which is not superior compared with the current 
PrELCA algorithm. 



V 




Figure 6: Node v is 100% an ELCA node, but cannot be dis- 
covered until all children have been visited. 



5. EXPERIMENTS 

In this section, we report the performance of the PrELCA algo- 
rithm in terms of effectiveness, time and space cost, and scalability. 
All experiments are done on a laptop with 2.27GHz Intel Pentium 
4 CPU and 3GB memory. The operation system is Windows 7, and 
code is written in Java. 

5.1 Datasets and Queries 

Two real life datasets, DBLlfl and MondiaQ, and one synthetic 
benchmark dataset, XMarlflhave been used. We also generate four 
test datasets with sizes lOM, 20M, 40M, 80M for XMark data. The 
three types of datasets are chosen due to the following typical fea- 
tures: DBLP is a large shallow dataset; Modial is a deep, complex, 
but small dataset; XMark is a balanced dataset, and users can define 
different depths and sizes to mimic various types of documents. 

For each dataset, we generate a corresponding probabilistic XML 
tree, using the same method in 1 17 1. To be specific, we traverse the 
original document in preorder, and for each visited node v, we ran- 
domly generate some distributional nodes with "IND" or "MUX" 
types as children of v. Then, for the original children of v, we 
choose some of them to be the children of the new generated distri- 
butional nodes and assign random probability distributions to these 
children with the restriction that the probability sum under a MUX 
node is no greater than I. For each dataset, the percentage of the 
IND and MUX nodes are controlled around 30% of the total nodes 
respectively. We also randomly select some terms and construct 
five keyword queries for different datasets, shown in Table[T] 



Table 1: Keyword Queries for Each Dataset 



ID 


Keyword Query 


ID 


Keyword Query 




United States, Graduate 


X2 


United States, Credit, Ship 




Check, Ship 


Xi 


Alexas, Ship 


X, 


Internationally, Ship 






Ml 


Muslim, Multiparty 


fih 


City, Area 


Ma 


United States, Islands 


Mi 


Government, Area 


M5 


Chinese, Polish 






Di 


Information, Retrieval, Database 


D2 


XML, Keyword, Query 


D3 


Query, Relational, Database 




probabilistic. Query 


Dr, 


stream. Query 







In Section [5!2l and l53] we will compare PrELCA algorithm with 
a counterpart algorithm, PrStack 1 1 1 1. We refer PrStack as PrSLCA 

' http://dblp.uni-trier.de/xml/ 

~http://www.dbis. informatik.uni-goettingen.de/Mondial/XML 
^ http://monetdb.cwi . nl/xml/ 



for the sake of antithesis. PrStack is an algorithm to find proba- 
bilistic SLCA elements from a probabilistic XML document. In 
Section we will compare search result confidence (probabili- 
ties) under the two semantics. In Section l53l we will report the 
run-time performance of both algorithms. 

5.2 Evaluation of Effectiveness 



Table 2: Comparison of ELCA and SLCA 



Queiies ©Mondial 


Max 


Mill 


Ava 


V J V IU.IJ 


Ml 


ELCA 


0.816 


G.426 


0.55 


60% 


SLCA 


0.703 


G.072 


0.23 


M2 


ELCA 


1 .000 


G.98G 


0.99 


100% 


SLCA 


1 .000 


G.98G 


0.99 


M3 


ELCA 


0.788 


G.304 


0.45 


40% 


SLCA 


0.582 


G.073 


0.13 


M4 


ELCA 


0.730 


G.IOG 


0.42 


20% 


SLCA 


0.180 


G.014 


0.08 


M5 


ELCA 


LOGO 


G.89G 


0.94 


90% 


SLCA 


LOGO 


G.84G 


0.9G 


Queiies@XMark 


Max 


Mill 


Avg 


Overlap 


XI 


ELCA 


0.560 


G.I 65 


0.27 


20% 


SLCA 


0.2G9 


G.054 


0.15 


X2 


ELCA 


0.789 


G.353 


0.54 


50% 


SLCA 


0.697 


G.I53 


0.22 


X3 


ELCA 


0.970 


G.553 


0.62 


30% 


SLCA 


0.750 


G.37G 


0.51 


X4 


ELCA 


0.716 


G.212 


0.34 


20% 


SLCA 


0.236 


G.014 


0.13 


X5 


ELCA 


0.735 


G.525 


0.62 


0% 


SLCA 


0.163 


0.044 


0.08 



Table [2] shows a comparison of probabilistic ELCA results and 
probabilistic SLCA results when we run the queries over Mondial 
dataset and XMark 20MB dataset. For each query and dataset pair, 
we select top- 10 results (with highest probabilities), and record the 
maximum, the minimum, and average probabilities of the top- 10 
results. We also count how many results are shared among the re- 
sults returned by different semantics. 

For some queries, M2 and M5, ELCA results are almost the 
same as SLCA results (see the Overlap column), but in most cases, 
ELCA results and SLCA results are different. Query X5 on XMark 
even returns totally different results for the two semantics. For 
other queries, at least 20% results are shared by the two seman- 
tics. After examining the returning results, we find that, most of 
time, PrELCA algorithm will not miss high-ranked results returned 
by PrSLCA. The reason is that, in an ordinary document, SLCAs 
are also ELCAs, so probabilistic SLCAs are also probabilistic EL- 
CAs. A node with high SLCA probability is likely to have ELCA 
probability. 

One interesting feature is that, compared with SLCA results, 
ELCA results always have higher probabilities (except for some 
queries returning similar results, like M2, M5). For queries Ml, 
M3, M4 on Mondial dataset, the average probability value of ELCA 
ranges from 0.42 to 0.55, while that of SLCA is about 0.08 - 0.23. 
On XMark dataset, we have a similar story, with average ELCA 
probability from 0.27 to 0.62 and average SLCA probability from 
0.08 - 0.51. ELCA results also have higher Max and Min val- 
ues. Since the probability reflects the likelihood that a node exists 
among all possible worlds as an ELCA or an SLCA, it is desirable 
that returned results have higher probability (or we say confidence). 



From this point of view, ELCA results are better that SLCA re- 
sults, because they have higher existence probabilities. Moreover, 
the Max probabilities of ELCA results are usually high, above 0.5 
in all query cases, but for some queries, such as M4, X5, the Max 
probabilities of SLCA results are below 0.2. If a user issue a thresh- 
old query asking results with probability higher than 0.4, there will 
be no result using SLCA semantics, but ELCA semantics still gives 
non-empty results. This could be a reason to use ELCA semantics 
to return keyword query results. 

For the DBLP dataset, we have not listed the results due to paper 
space limitation, but it is not difficult to understand that probabilis- 
tic ELCA results and probabilistic SLCA results are very similar 
on the DBLP dataset, since it is a flat and shallow dataset. 

5.3 Evaluation of Time Cost and Space Cost 

Fig. |7] shows the time and space cost when we run the queries 
X1-X5 on Doc2, M1-M5 on Doc5, and D1-D5 on Doc6. From 
Fig. |7(a)[ |7(c)| [7(e)] we can see that both algorithms PrELCA and 
PrSLCA are efficient. Although ELCA semantics is more com- 
plex than SLCA semantics, PrELCA algorithm has a similar per- 
formance as PrSLCA algorithm in terms of time cost. The reason 
may be that both PrELCA and PrSLCA algorithms are stack-based 
algorithms and access keyword inverted lists in a similar manner. 
PrELCA algorithm is slightly slower than PrSLCA in most cases, 
which is acceptable, because ELCA semantics is more complex 
and needs more computation. The gap is not large, reflecting that 
PrELCA algorithm is a competent algorithm if users would like to 
know probabilistic ELCAs rather than probabilistic SLCAs. From 
Fig. |7(b)| |7(d)| and |7(f)| we can see that PrELCA consumes more 
memory than PrSLCA. This is because besides the keyword dis- 
tribution tables which are used in both algorithms, PrELCA has 
to maintain some other intermediate results to compute the final 
ELCA probabilities, such as the intermediate table mentioned in 
Eguationll2land[T3lin Section lT2l 

5.4 Evaluation of Scalability 

In this section, we use XMark dataset to test the scalability of 
the PrELCA algorithm. We test two queries Xi , X2 on the XMark 
dataset ranging from lOM to 80M. Fig. |8(a)| shows that the time cost 
of both queries is going up moderately when the size of the dataset 
increases. Fig. |8(b)| shows that space cost has a similar trend as 
the time cost, when document size is increasing. The experiment 
shows that, for various keyword queries, PrELCA algorithm scales 
well on different documents, although different queries may con- 
sume different memories and run for different time, due to different 
lengths of the inverted lists. 

6. RELATED WORK 

There are two streams of works related to our work: probabilis- 
tic XML data management and keyword search on ordinary XML 
documents. 

Uncertain data management draws the attention of database re- 
search community recently, including both structured and semi- 
structured data. In the XML context, the first probabilistic XML 
model is ProTDB |2|. In ProTDB, two new types of nodes are 
added into a plain XML document. IND describes independent 
children and MUX describes mutually-exclusive children. Corre- 
spondingly, to answer a twig query on a probabilistic XML doc- 
ument is to find a set of results matching the twig pattern but the 
results will have existence probabilities. Hung et al. |3| modeled 
probabilistic XML documents as directed acyclic graphs, explicitly 
specifying probability distribution over child nodes. In |4|, prob- 
abilities are defined as intervals, not points. Keulen et al. \5] in- 
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troduced how to use probabilistic XML in data integration. Their 
model is a simple model, only considering mutually-exclusive sub- 
elements. Abiteboul and Senellart |6| proposed a "fuzzy trees" 
model, where the existence of the nodes in the probabilistic XML 
document is defined by conjunctive events. They also gave a full 
complexity analysis of querying and updating on the "fuzzy trees" 
in |I |. In |7|, Abiteboul et al. summarized all the probabilistic 
XML models in one framework, and studied the expressiveness and 
translations between different models. ProTDB is represented as 
PrXML* using their framework. Cohen et al. |19| incor- 
porated a set of constraints to express more complex dependencies 
among the probabilistic data. They also proposed efficient algo- 
rithms to solve the constraint-satisfaction, query evaluation, and 
sampling problem under a set of constraints. On querying proba- 
bilistic XML data, twig query evaluation without index (node lists) 
and with index are considered in 1 20) and flOil respectively. Chang 
et al. |9| addressed a more complex situation where result weight 
is also considered. The most closest work to ours is llll . Com- 
pared to SLCA semantics in (TT|, we studied a more complex but 
reasonable semantics, ELCA semantics. 

Keyword search on ordinary XML documents has been exten- 
sively investigated in the past few years. Keyword search results are 
usually considered as fragments from the XML document. Most 
works use LCA (lowest common ancestor) semantics to find a set 
of fragments. Each fragment contains all the keywords. These se- 
mantics include ELCA (121 [131 [TH, SLCA ll5jLL6|, MLCA (H] 
and Interconnection Relationship 1221 . Other LCA-based query re- 
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suit semantics rely more or less on SLCA or ELCA by either im- 
posing further conditions on the LCA nodes I.23J or refining the 
subtrees rooted at the LCA nodes 1241 1251 1261 . The works 1271 
and 1 28 1 utilize statistics of the underlying XML data to identify 
possible query results. All the above works consider determinis- 
tic XML trees. Algorithms on deterministic documents cannot be 
directly used on probabilistic documents, because, on probabilistic 
XML documents, a node may or may not appear, as a result, a node 
may be an LCA in one possible world, but not in another. How 
to compute the LCA probability for a node also comes along as a 
challenge. 

7. CONCLUSIONS 

In this paper, we have studied keyword search on probabilistic 
XML documents. The probabilistic XML data follows a popu- 
lar probabilistic XML model, PrXML<™'''"'"^>. We have defined 
probabilistic ELCA semantics for a keyword query on a proba- 
bilistic XML document in terms of possible world semantics. A 
stacked-based algorithm, PrELCA, has been proposed to find prob- 
abilistic ELCAs and their ELCA probabilities without generating 
possible worlds. We have conducted extensive experiments to test 
the performance of the PrELCA algorithm in terms of effectivenss, 
time and space cost, and scalability. We have compared the re- 
sults with a previous SLCA based algorithm. The experiments have 
shown that ELCA semantics gives better keyword queries results 
with only slight performance sacrifice. 
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